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Abstract 

Nonnegative matrix factorization (NMF) has become a ubiquitous tool for 
data analysis. An important variant is the sparse NMF problem which 
arises when we explicitly require the learnt features to be sparse. A natural 
measure of sparsity is the Lo norm, however its optimization is NP-hard. 
Mixed norms, such as L1/L2 measure, have been shown to model sparsity 
robustly, based on intuitive attributes that such measures need to satisfy. 
This is in contrast to computationally cheaper alternatives such as the plain 
Li norm. However, present algorithms designed for optimizing the mixed 
norm Li /L2 are slow and other formulations for sparse NMF have been pro- 
posed such as those based on Li and Lq norms. Our proposed algorithm 
allows us to solve the mixed norm sparsity constraints while not sacrificing 
computation time. We present experimental evidence on real-world datasets 
that shows our new algorithm performs an order of magnitude faster com- 
pared to the current state-of-the-art solvers optimizing the mixed norm and 
is suitable for large-scale datasets. 



1 Introduction 

Matrix factorization arises in a wide range of application domains and is useful for extracting 
the latent features in the dataset (Figure 1). In particular, we are interested in matrix 
factorizations which impose the following requirements: 

• nonnegativity 

• low-rankedness 

• sparsity 

Normegativity is a natural constraint when modeling data with physical constraints such as 
chemical concentrations in solutions, pixel intensities in images and radiation dosages for 
cancer treatment. Low-rankedness is useful for learning a lower dimensionality represen- 
tation. Sparsity is useful for modeling the conciseness of the representation or that of the 
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latent features. Imposing all these requirements on our matrix factorization leads to the 
sparse nonnegative matrix factorization (SNMF) problem. 

SNMF enjoys quite a few formulations [2, 14, 13, 11, 24, 17, 25, 26] with successful applica- 
tions to single-channel speech separation [27] and micro-array data analysis [17, 25]. 

However, algorithms [14, 11] for solving SNMF which utilize the mixed norm of L1/L2 as 
their sparsity measure are slow and do not scale well to large datasets. Thus, we develop 
an efficient algorithm to solve this problem and has the following ingredients: 

• A theoretically efficient projection operator (0(m log m)) to enforce the user-defined 
sparsity where m is the dimensionality of the feature vector as opposed to the 
previous approach [14]. 

• Novel sequential updates which provide the bulk of our speedup compared to the 
previously employed batch methods [14, 11]. 
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Figure 1: (Left) Features learned from the ORL dataset^with various matrix factorization 
methods such as principal component analysis (PCA), independent component analysis 
(ICA), and dictionary learning. The relative merit of the various matrix factorizations 
depends on both the signal domain and the target application of interest. (Right) Features 
learned under the sparse NMF formulation where roughly half the features were constrained 
to lie in the interval [0.2, 0.4] and the rest are fixed to sparsity value 0.7. This illustrates 
the flexibility that the user has in fine tuning the feature sparsity based on prior domain 
knowledge. White pixels in this figure correspond to the zeros in the features. 



2 Preliminaries and Previous Work 

In this section, we give an introduction to the nonnegative matrix factorization (NMF) and 
SNMF problems. Also, we discuss some widely used algorithms from the literature to solve 
them. 

Both these problems share the following problem and solution structure. At a high-level, 
given a nonnegative matrix X of size m x n, we want to approximate it with a product of 
two nonnegative matrices W, H of sizes m x r and r x n, respectively: 

(1) X « WH. 

The nonnegative constraint on matrix H makes the representation a conical combination 
of features given by the columns of matrix W. In particular, NMF can result in sparse 
representations, or a parts-based representation, unlike other factorization techniques such 
as principal component analysis (PCA) and vector quantization (VQ). A common theme in 

^Scikit-learn package was used in generating the figure. 
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the algorithms proposed for solving these problems is the use of alternating updates to the 
matrix factors, which is natural because the objective function to be minimized is convex 
in W and in H, separately, but not in both together. Much effort has been focused on 
optimizing the efficiency of the core step of updating one of W, H while the other stays 
fixed. 



2.1 Nonnegative Matrix Factorization 

Factoring a matrix, all of whose entries are nonnegative, as a product of two low-rank non- 
negative factors is a fundamental algorithmic challenge. This has arisen naturally in diverse 
areas such as image analysis [20], micro-array data analysis [17], document clustering [31], 
chemometrics [19], information retrieval [12] and biology applications [4]. For further appli- 
cations, see the references in the following papers [1, 7]. 

We will consider the following version of the NMF problem, which measures the reconstruc- 
tion error using the Frobenius norm [21]: 

(2) mini||X-WH||2, s.t. W > 0, H > 0, ||W^,||2 = 1, Vj e {1, • • • ,r} 

where > is element-wise. We use subscripts to denote column elements. Simple multiplica- 
tive updates were proposed by Lee and Seung to solve the NMF problem. This is attractive 
for the following reasons: 

• Unlike additive gradient descent methods, there is no arbitrary learning rate pa- 
rameter that needs to be set. 

• The nonnegativity constraint is satisfied automatically, without any additional pro- 
jection step. 

• The objective function converges to a limit point and the values are non-increasing 
across the updates, as shown by Lee and Seung [21]. 

Algorithm 1 is an example of the kind of multiplicative update procedure used, for instance, 
by Lee and Seung [21]. The algorithm alternates between updating the matrices W and H 
(we have only shown the updates for H — those for W are analogous). 



Algorithm 1 nnls-mult(X, W, H) 
1: repeat 

2: H = H -v^^H • 
3: until convergence 
4: Output: Matrix H. 



Here, indicates element-wise (Hadamard) product and matrix division is also element- 
wise. To remove the scaling ambiguity, the norm of columns of matrix W are set to unity. 
Also, a small constant, say 10~^, is added to the denominator in the updates to avoid 
division by zero. 

Besides multiplicative updates, other algorithms have been proposed to solve the NMF 
problem based on projected gradient [22], block pivoting [18], sequential constrained opti- 
mization [6] and greedy coordinate-descent [15]. 

2.2 Sparse Nonnegative Matrix Factorization 

The nonnegative decomposition is in general not unique [9] . Furthermore, the features may 
not be parts-based if the data resides well inside the positive orthant. To address these 
issues, sparseness constraints have been imposed on the NMF problem. 

Sparse NMF can be formulated in many different ways. From a user point of view, we can 
split them into two classes of formulations: explicit and implicit. In explicit versions of 
SNMF [14, 11], one can set the sparsities of the matrix factors W, H directly. On the other 
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hand, in implicit versions of SNMF [17, 25], the sparsity is controlled via a regularization 
parameter and is often hard to tune to specified sparsity values a priori. However, the 
algorithms for implicit versions tend to be faster compared to the explicit versions of SNMF. 

In this paper, we consider the explicit sparse NMF formulation proposed by Hoyer [14]. 
To make the presentation easier to follow, we first consider the case where the sparsity is 
imposed on one of the matrix factors, namely the feature matrix W — the analysis for the 
symmetric case where the sparsity is instead set on the other matrix factor H is analogous. 
The case where sparsity requirements are imposed on both the matrix factors is dealt with in 
the Appendix. The sparse NMF problem formulated by Hoyer [14] with sparsity on matrix 
W is as follows: 

mm f(W,H) =-j|X - WHlll s.t. W > 0,H > 0, 

(3) ||W,H2 = l,sp(W,) = a, Vj€{l,--- ,r} 
Sparsity measure for a d-dimensional vector x is given by: 

(4) sp(.) 

The sparsity measure (4) defined above has many appealing qualities. Some of which are as 
follows: 



• The measure closely models the intuitive notion of sparsity as captured by the Lq 
norm. So, it easy for the user to specify sparsity constraints from prior knowledge 
of the application domain. 

• Simultaneously, it is able to avoid the pitfalls associated with directly optimizing 
the Lq norm. Desirable properties for sparsity measures have been previously ex- 
plored [16] and it satisfies all of these properties for our problem formulation. The 
properties can be briefly summarized as: (a) Robin Hood — Spreading the energy 
from larger coordinates to smaller ones decreases sparsity, (b) Scaling — Sparsity 
is invariant to scaling, (c) Rising tide — Adding a constant to the coordinates 
decreases sparsity, (d) Cloning — Sparsity is invariant to cloning, (e) Bill Gates 
— One big coordinate can increase sparsity, (f) Babies — coordinates with zeros 
increase sparsity. 

• The above sparsity measure enables one to limit the sparsity for each feature to lie 
in a given range by changing the equality constraints in the SNMF formulation (3) 
to inequality constraints [11]. This could be useful in scenarios like fMRI brain 
analysis, where one would like to model the prior knowledge such as sizes of artifacts 
are different from that of the brain signals. A sample illustration on a face dataset 
is shown in Figure 1 (Right). The features are now evenly split into two groups of 
local and global features by choosing two different intervals of sparsity. 

A gradient descent-based algorithm called Nonncgativc Matrix Factorization with Sparse- 
ness Constraints (NMFSC) to solve SNMF was proposed [14]. Multiplicative updates 
were used for optimizing the matrix factor which did not have sparsity constraints spec- 
ified. Heiler and Schn6rr[ll] proposed two new algorithms which also solved this prob- 
lem by sequential cone programming and utilized general purpose solvers like MOSEK 
(http://www.mosek.com). We will consider the faster one of these called tangent-plane 
constraint (TPC) algorithm. However, both these algorithms, namely NMFSC and TPC, 
solve for the whole matrix of coeSicients at once. In contrast, we propose a block coordinate- 
descent strategy which considers a sequence of vector problems where each one can be solved 
in closed form efficiently. 



3 The Sequential Sparse NMF Algorithm 

We present our algorithm which we call Sequential Sparse NMF (SSNMF) to solve the 
SNMF problem as foUows: 
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First, we consider a problem of special form which is the building block (Algorithm 2) of our 
SSNMF algorithm and give an efficient, as well as exact, algorithm to solve it. Second, we 
describe our sequential approach (Algorithm 3) to solve the subproblem of SNMF. This uses 
the routine we developed in the previous step. Finally, we combine our routines developed 
in the previous two steps along with standard solvers (for instance Algorithm 1) to complete 
the SSNMF Algorithm (Algorithm 4). 



3.1 Sparse-opt 

Sparse-opt routine solves the following subproblem which arises when solving problem (3): 
(5) maxti^y s.t. ||y||i = fc, ||y||2 = 1 

where vector b is of size m. This problem has been previously considered [14], and an 
algorithm to solve it was proposed which we will henceforth refer to as the Projection- 
Hoycr. Similar projection problems have been recently considered in the literature and 
solved efficiently [10, 5]. 

Observation 1. For any we have that if hi > bj, then yi > yj. 

Let us first consider the case when the vector b is sorted. Then by the previous observation, 
we have a transition point p that separates the zeros of the solution vector from the rest. 

Observation 2. By applying the Cauchy-Schwarz inequality on y and the all ones vector, 
we get p> k^. 

The Lagrangian of the problem (5) is : 

Setting the partial derivatives of the Lagrangian to zero, we get by observation 1: 

^Vi = k,^yl = 1 

i i 

bi + li{p) + \{p)yi = 0, Vi G {1, 2, • • • ,p} 
7i = 0,Vze {I,-- - ,p} 

= o,Vi e {p + 1, • • • ,to} 

where wc account for the dependence of the Lagrange parameters A, /i on the transition 
point p. We compute the objective value of problem (5) for all transition points p in 
the range from fc^ to m and select the one with the highest value. In the case, where 
the vector b is not sorted, we just simply sort it and note down the sorting permutation 
vector. The complete algorithm is given in Algorithm 2. The dominant contribution to the 
running time of Algorithm 2 is the sorting of vector b and therefore can be implemented 
in 0(m log m) time''. Contrast this with the running time of Projection-Hoyer whose worst 
case is Oini^) [14, 28]. 



3.2 Sequential Approach — Block Coordinate Descent 

Previous approaches for solving SNMF [14, 11] use batch methods to solve for sparsity 
constraints. That is, the whole matrix is updated at once and projected to satisfy the con- 
straints. We take a different approach of updating a column vector at a time. This gives us 
the benefit of being able to solve the subproblem (column) efficiently and exactly. Subse- 
quent updates can benefit from the newly updated columns resulting in faster convergence 
as seen in the experiments. 

^This can be further reduced to linear time by noting that we do not need to fully sort the input 
in order to find p*. 
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Algorithm 2 Sparse-opt(6, A:) 

1: Set a = sort(6) and p* = m. Get a mapping tt such that = and aj > aj+i for 

all valid j. 
2: Compute values of ii{p),\{p) as follows: 
3: for p e { [fc^] ,m} do 

5, p(p) = -Sjiiil - 4a(p) 

6: if a{p) < — then 
7: p* = p - 1 
8: break 
9: end if 
10: end for 

11: Set Xi = — ^^X(j^y^'^* € {1, • • • ,p*} and to zero otherwise. 
12: Output: Solution vector y where = Xi. 



In particular, consider the optimization problem (3) for a column j of the matrix W while 
fixing the rest of the elements of matrices W, H: 

min /(W,-) = IgWWjWl+u'W, s.t. ||W,||2 = 1, ||W,||i = k 

where g = Hj Hj and u = —XHj + J2i^j Wi{HEr)ij. This reduces to the problem (5) for 
which we have proposed an exact algorithm (Algorithm 2). We update the columns of the 
matrix factor W sequentially as shown in Algorithm 3. We call it sequential for we update 

the columns one at a time. Note that this approach can be scon as an instance of block 
coordinate descent methods by mapping features to blocks and the Sparse-opt projection 
operator to a descent step. 



Algorithm 3 sequential-pass(X, W, H) 

1: C = -XH^+WHH^ 

2: G = HH^ 
3: repeat 

4: for j = 1 to r (randomly) do 

5: \J,=C,-W,G,, 

6: t = Sparse-opt (—Uj, fc). 

7: C = C + {t-Wj)G] 

8: Wj = t. 

9: end for 
10: until convergence 
11: Output: Matrix W. 



3.3 SSNMF Algorithm for Sparse NMF 

We are now in a position to present our complete Sequential Sparse NMF (SSNMF) algo- 
rithm. By combining Algorithms 1, 2 and 3, we obtain SSNMF (Algorithm 4). 



Algorithm 4 ssnmf(X, W, H) 
1: repeat 

2: W = sequential-pass(X, W, H) 
3: H nnls-mult(X, W, H) 
4: until convergence 
5: Output: Matrices W, H. 
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4 Implementation Issues 



For clarity of exposition, wc presented the plain vanilla version of our SSNMF Algorithm 4. 
We now describe some of the actual implementation details. 

• Initialization: Generate a positive random vector v of size m and obtain z = 
Sparse-opt(w, fc) where k = \pm — a\/m — 1 (from equation (4)). Use the solu- 
tion z and its random permutations to initialize matrix W. Initialize the matrix H 

to uniform random entries in [0, 1]. 

• Incorporating faster solvers: We use multiplicative updates for a fair comparison 
with NMFSC and TPC. However, we can use other NNLS solvers [22, 18, 6, 15] 
to solve for matrix H. Empirical results (not reported here) show that this further 
speeds up the SSNMF algorithm. 

• Termination: In our experiments, we fix the number of alternate updates or equiv- 

alently the number of times we update matrix W . Other approaches include spec- 
ifying total running time, relative change in objective value between iterations or 
approximate satisfaction of KKT conditions. 

• Sparsity constraints: We have primarly considered the sparse NMF model as for- 
mulated by Hoyer [14]. This has been generalized by Heiler and Schnorr [11] by 
relaxing the sparsity constraints to lie in user-defined intervals. Note that, we can 
handle this formulation [11] by making a trivial change to Algorithm 3. 

5 Experiments and Discussion 

In this section, we compare the performance of our algorithm with the state-of-the-art 
NMFSC and TPC algorithms [14, 11]. Running times for the algorithms are presented when 
applied to one synthetic and three real- world datasets. Experiments report reconstruction 
error (||X— WH||i?) instead of objective value for convenience of display. For all experiments 
on the datasets, we ensure that our final reconstruction error is always better than that of 
the other two algorithms. Our algorithm was implemented in MATLAB (http://www. 
mathworks . com) similar to NMFSC and TPC. All of our experiments were run on a 3.2Ghz 
Intel machine with 24GB of RAM and the number of threads set to one. 

5.1 Datcisets 

For comparing the performance of SSNMF with NMFSC and TPC, we consider the following 
synthetic and three real- world datasets : 

• Synthetic: 200 images of size 9 x 9 as provided by Heiler and Schnorr [11] in their 

code implementation. 

• CBCL face dataset consists of 2429 images of size 19 x 19 and can be obtained at 
http : //cbcl .mit . edu/cbcl/sof tware-datasets/FaceData2 .html. 

• ORL: Face dataset that consists of 400 images of size 112 x 92 and can be ob- 
tained at http: //www. cl . cam. ac.uk/research/ dtg/ attar chive/facedatabase . 
html, http://www.kyb.tuebingen.mpg.de/bethge/vanhateren/iml/. We use 
400 of these images in our experiments. 

• sMRI: Structural MRI scans of 269 subjects taken at the John Hopkins University 
were obtained. The scans were taken on a single 1.5T scanner with the imaging 
parameters set to 35mm TR, 5ms TE, matrix size of 256 x 256. We segment 
these images into gray matter, white matter and cerebral spinal fluid images, using 
the software program SPM5 (http://www.fil.ion.ucl.ac.uk/spm/software/spm5/), 
followed by spatial smoothing with a Gaussian kernel of 10 x 10 x 10 mm. This 
results in images which are of size 105 x 127 x 46 
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Figure 2: Mean running times for Sparse-opt and the Project ion- Hoyer are presented for 
random problems. The x-axis plots the dimension of the problem while the y-axis has the 
running time in seconds. Each of the subfigures corresponds to a single sparsity value in 
{0.2,0.4,0.6,0.8}. Each datapoint corresponds to the mean running time averaged over 40 
runs for random problems of the same fixed dimension. 
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Figure 3: Running times for SSNMF and NMFSC and TPC algorithms on the synthetic 
dataset where the sparsity values range from 0.2 to 0.8 and number of features is 5. Note 
that SSNMF and NMFSC are over an order of magnitude faster than TPC. 

5.2 Comparing Performances of Core Updates 

We compare our Sparse-opt (Algorithm 2) routine with the competing Projection- 
Hoyer [14]. In particular, we generate 40 random problems for each sparsity constraint 
in {0.2,0.4,0.6,0.8} and a fixed problem size. The problems are of size 2' x 100 where 
i takes integer values from to 12. Input coefficients are generated by drawing samples 
uniformly at random from [0, 1]. The mean values of the running times for Sparse-opt and 
the Projection-Hoyer for each dimension and corresponding sparsity value are plotted in 
Figure 2. 

We compare SSNMF with SSNMF-|-Proj on the CBCL dataset. The algorithms were run 
with rank set to 49. The running times are shown in Figure 6. We see that in low- dimensional 
datasets, the difference in running times are very small. 

5.3 Comparing Overall Performances 

SSNMF versus NMFSC and TPC: We plot the performance of SSNMF against 
NMFSC and TPC on the synthetic dataset provided by Heiler and Schnorr [11] in Fig- 
ure 3. We used the default settings for both NMFSC and TPC using the software provided 
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Figure 4: Convergence plots for the ORL dataset with sparsity from [0.1, 0.8] for the NMFSC 
and SSNMF algorithms. Note that we are an order of magnitude faster, especially when 
the sparsity is higher. 
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Figure 5: Running times for SSNMF and NMFSC algorithms for the sMRI dataset with 
rank set to 40 and sparsity values of a from 0.1 to 0.8. Note that for higher sparsity values 
we converged to a lower reconstruction error and arc also noticeably faster than the NMFSC 
algorithm. 

by the authors. Our experience with TPC was not encouraging on bigger datasets and hence 
we show its performance only on the synthetic dataset. It is possible that the performance 
of TPC can be improved by changing the default settings but we found it non-trivial to do 
so. 

SSNMF versus NMFSC: To ensure fairness, we removed logging information from 
NMFSC code [14] and only computed the objective for equivalent number of matrix updates 
as SSNMF. We do not plot the objective values at the first iteration for convenience of 
display. However, they are the same for both algorithms because of the shared initialization 
. We ran the SSNMF and NMFSC on the ORL face dataset. The rank was fixed at 25 
in both the algorithms. Also, the plots of running times versus objective values are shown 
in Figure 4 corresponding to sparsity values ranging from 0.1 to 0.7. Additionally, we ran 
our SSNMF algorithm and NMFSC algorithm on a large-scale dataset consisting of the 
structural MRI images by setting the rank to 40. The running times are shown in Figure 5. 
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time (seconds) time (seconds) time (seconds) time (seconds) 

Figure 6: Running times for SSNMF and SSNMF+Proj algorithms for the CBCL face 
dataset with rank set to 49 and sparsity values ranging from 0.2 to 0.9 

5.4 Main Results 

We compared the running times of our Sparse-opt routine versus the Projection-Hoyer and 
found that on the synthetically generated datasets we are faster on average. 

Our results on switching the Sparse-opt routine with the Projection-Hoyer did not slow 
down our SSNMF solver significantly for the datasets we considered. So, we conclude that 
the speedup is mainly due to the sequential nature of the updates (Algorithm 3). 

Also, we converge faster than NMFSC for fewer number of matrix updates. This can be 
seen by noting that the plotted points in Figures 4 and 5 are such that the number of matrix 
updates are the same for both SSNMF and NMFSC. For some datasets, we noted a speedup 
of an order of magnitude making our approach attractive for computation purposes. 

Finally, we note that we recover a parts-based representation as shown by Hoyer [14]. An 
example of the obtained features by NMFSC and ours is shown in Figure 7. 
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(c) sparsity 0.75 



Figure 7: Feature sets from NMFSC algorithm (Left) and SSNMF algorithm (Right) using 
the ORL face dataset for each sparsity value of a in {0.5,0.6,0.75}. Note that SSNMF 
algorithm gives a parts-based representation similar to the one recovered by NMFSC. 



6 Connections to Related Work 

Other SNMF formulations have been considered by Hoyer [13], M0rup et al. [24] , Kim 
and Park [17], Pascual-Montano et al. [25] (nsNMF) and Peharz and Pernkopf [26]. SNMF 
formulations using similar sparsity measures as used in this paper have been considered for 
applications in speech and audio recordings [30, 29]. 

We note that our sparsity measure has all the desirable properties, extensively discussed 
by Hurley and Rickard [16], except for one ("cloning"). Cloning property is satisfied when 
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two vectors of same sparsity when concatenated maintain their sparsity value. Dimensions 
in our optimization problem are fixed and thus violating the cloning property is not an 
issue. Compare this with the Li norm that satisfies only one of these properties (namely 
"rising tide"). Rising tide is the property where adding a constant to the elements of a 
vector decreases the sparsity of the vc;c;tor. Nevertheless, the measure used in Kim and Park 
is based on the Li norm. The properties satisfied by the measure in Pascual-Montano et al. 
are unclear because of the implicit nature of the sparsity formulation. 

Pascual-Montano et al. [25] claim that the SNMF formulation of Hoyer, as given by prob- 
lem (3) does not capture the variance in the data. However, some transformation of the 
sparsity values is required to properly compare the two formulations [14, 25]. Preliminary 
results show that the formulation given by Hoyer [14] is able to capture the variance in the 
data if the sparsity parameters arc set appropriately. Peharz and Pernkopf [26] propose 
to tackle the Lq norm constrained NMF directly by projecting from intermediate uncon- 
strained solutions to the required Lq constraint. This leads to the well-known problem of 
getting stuck in local minima. Indeed, the authors re-initialize their feature matrix with 
an NNLS solver to recover from the local suboptimum. Our formulation avoids the local 
minima associated with Lq norm by using a smooth surrogate. 

7 Conclusions 

We have proposed a new efficient algorithm to solve the sparse NMF problem. Experiments 
demonstrate the effectiveness of our approach on real datasets of practical interest. Our 
algorithm is faster over a range of sparsity values and generally performs better when the 
sparsity is higher. The speed up is mainly because of the sequential nature of the updates 
in contrast to the previously employed batch updates of Hoyer. Also, we presented an exact 
and efficient algorithm to solve the problem of maximizing a linear objective with a sparsity 
constraint, which is an improvement over the heuristic approach in Hoyer. 

Our approach can be extended to other NMF variants [13]. Another possible application is 
the sparse version of nonnegative tensor factorization. A different research direction would 
be to scale our algorithm to handle large datasets by chunking [23] and/or take advantage 
of distributed/parallel computational settings [3]. 
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Appendix 

Bi-Sparse NMF 

In some applications, it is desirable to set the sparsity on both matrix factors. However, 
this can lead to the situation where the variance in the data is poorly captured [25]. To 
ameliorate this condition, we formulate it as the following optimization problem and call it 
as Bi-Sparse NMF: 

min J||X-WDH|||, 

s.t.W > 0,H > 0,D > 

||W,|l2 = l,sp(W,) = a,Vje{l,-- - ,r} 
(6) ||Hl2 = l,sp(ff)=/3,V2e{l,--- ,r} 

where D is a r x r matrix. In the above formulation, wc constrain the L2 norms of the 
columns of matrix W to unity. Similarly, wc constrain the L2 norms of rows of matrix H 
to be unity. This scaling is absorbed by the matrix D. Note that this formulation with the 
matrix D constrained to be diagonal is equivalent to the one proposed in Hoyer when both 
the matrix factors have their sparsity specified. 

We can solve for the matrix D with any NNLS solver. A concrete algorithm is the one 
presented in Ding et al. and is reproduced here for convenience (Algorithm 5). If D is a 
diagonal matrix, we only update the diagonal terms and maintain the rest at zero. Algo- 
rithms 1 and 5 can be sped up by pre-computing the matrix products which are unchanged 
during the iterations. 

Also, the matrix D captures the variance of the dataset when we have sparsity set on both 
the matrices W, H. 
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Algorithm 5 Diag-mult(X, W, H, D) 



repeat 

n _ r) W^XH 

^ ~ ^ ^ Wwdhht 
until convergence 

Output: Matrix D. 
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